Molecular characterization of Richter syndrome identifies de novo diffuse large B-cell lymphomas with poor prognosis

Richter syndrome (RS) is the transformation of chronic lymphocytic leukemia (CLL) into aggressive lymphoma, most commonly diffuse large B-cell lymphoma (DLBCL). We characterize 58 primary human RS samples by genome-wide DNA methylation and whole-transcriptome profiling. Our comprehensive approach determines RS DNA methylation profile and unravels a CLL epigenetic imprint, allowing CLL-RS clonal relationship assessment without the need of the initial CLL tumor DNA. DNA methylation- and transcriptomic-based classifiers were developed, and testing on landmark DLBCL datasets identifies a poor-prognosis, activated B-cell-like DLBCL subset in 111/1772 samples. The classification robustly identifies phenotypes very similar to RS with a specific genomic profile, accounting for 4.3-8.3% of de novo DLBCLs. In this work, RS multi-omics characterization determines oncogenic mechanisms, establishes a surrogate marker for CLL-RS clonal relationship, and provides a clinically relevant classifier for a subset of primary “RS-type DLBCL” with unfavorable prognosis.

termed Richter syndrome (RS) 3 . Diffuse large B cell lymphoma (DLBCL) subtype accounts for 90-95% of RS cases. Around 80% of RS cases are IGHV-unmutated while the remainder are IGHV-mutated 4 . In contrast, most de novo DLBCL (from now on called DLBCL) are IGHVmutated as they originate from germinal center (GC) or post-GC B cells. Based on gene expression patterns, different cell-of-origin (COO) derivations of DLBCL include GC B cell like (GCB) and activated B celllike (ABC) DLBCL 5 . Recent genomic studies combining DNA and RNA sequencing extended DLBCL subtyping beyond COO 6-10 , identifying DLBCL subgroups defined by their genomic alteration patterns and associated clinical courses, but a notable proportion remains unclassified 7,8,10 . Moreover, although studies have shown some extent of association of genetically defined groups with transcriptionally defined COO signatures, the transcriptome in its entirety is not fully used in current classifications.
As compared to other lymphoid malignancies, the availability of in vitro or in vivo models to study RS is limited [11][12][13][14][15] , and therefore our current knowledge on RS biology remains incomplete. The few genomic studies attempting to decipher oncogenic mechanisms underlying RS described disabled DNA damage response and cell cycle control through TP53 abnormalities and CDKN2A deletions, chronic B cell receptor (BCR) signaling, and NOTCH, MYC, and MAPK pathway deregulations [16][17][18][19][20] . A recent report using multiome and single cell approaches in sequential CLL-RS samples describes that the increased molecular complexity of RS does not seem to be the consequence of clonal evolution over time but rather the selection of minute subclones present at CLL diagnosis and years before overt transformation 21 . Additionally, recent studies focusing on DNA methylation (DNAm) further captured the genomic complexity of CLL [22][23][24][25][26] , RS 27 , de novo DLBCL [28][29][30] , and other B cell neoplasms [31][32][33] . A better understanding of epigenetic signatures is needed, whether related to B cell development or tumor transformation mechanisms.
Distinguishing between CLL-derived RS and de novo DLBCL in a diagnostic setting based on histology and immunochemistry alone is challenging. Around 80% of RS cases are clonally related to the CLL disease stage while the remainder are unrelated (i.e. independent de novo DLBCL). This dichotomy is of importance for treatment decisions. De novo DLBCLs are chemosensitive in most patients, whereas CLL-derived RS is mainly characterized by chemoresistance and poor outcome, with a median overall survival (OS) of around 12 months.
In this study, we perform genome-wide DNAm analysis and wholetranscriptome profiling for a large series of primary human RS samples, and comprehensively compare our findings to those in CLL and DLBCL. We extensively characterize the epigenetic architecture of the RS samples and find the majority retain a CLL imprint. Remarkably, applying DNAm-and gene expression-based classifiers to datasets from landmark studies identifies a subset of "RS-type" DLBCL that is not previously described at the genomic level, is enriched in cases with an ABC-like COO signature, and has an unfavorable prognosis 7,8,10 .

Data quality controls
The study workflow is described in Fig. 1. We investigated DNAm using array-based technologies, exploring a total of 433 samples, including 58 RS samples, 25 CLLs paired with RS (i.e. tumor samples were available at both CLL and RS stages; hereafter "paired-CLLs"), 68 DLBCLs, and additional published methylomes from 190 other CLLs, and 92 samples representing normal B cell subpopulations (Supplementary Fig. 1) 22,25,26,34,35 . Limiting the batch effect is critical for comparing large cohorts explored with different platforms in different facilities. In this regard, we used EPIC and 450K Illumina microarray platforms, as these provide accurate, robust, and reproducible genome-wide coverage of CpG sites 36,37 . We extensively explored potential batch effects and showed it was completely removed after applying strict quality  25 CLLs paired with RS (tumor DNA samples were available at both CLL and RS stages), 190 other CLLs, 68 de novo DLBCLs, and 92 samples from normal B cells spanning the entire B lineage. All 58 RS samples were also documented for mutations in a custom panel of 13 CLL driver genes, and RNA-sequencing data were concomitantly available for 41 RS samples, allowing integrative analyses and detailed exploration of oncogenic processes and epigenetic network deregulations. RNA sequencing data were obtained for another 6 RS, 28 de novo DLBCLs, and 10 non-tumoral lymph nodes. Data acquired from normal B cell control groups were used for methodologic purposes only (see "Methods"). CLL chronic lymphocytic leukemia, DLBCL de novo diffuse large B cell lymphoma, NGS nextgeneration sequencing, RS Richter syndrome. controls (see "Methods"). In addition, we applied a bioinformatic deconvolution method to separate methylation data attributable to five subtypes of normal white blood cells (CD4+ T-lymphocytes, CD8+ T-lymphocytes, neutrophils, monocytes, B cells). Use of respective cell composition data as covariates in supervised analyses limited the influence of tumor cell content of our samples. Next, we annotated CpGs differentially methylated between RS, CLL, and DLBCL according to 12 chromatin states reported in 7 CLL reference epigenomes 26 . The 102,614 CpGs differentially methylated between RS and CLL (two-way moderated t test adjusted for a false discovery rate (FDR) < 0.01; 90.8% hypomethylations in RS) were: depleted (ratio < 0.75) in active promoters, poised promoters, promoter-associated strong enhancers, and weak promoters; and enriched (ratio > 1.5) in transcription transition regions and heterochromatin (Fig. 2c). The 82,940 CpGs differentially methylated between RS and DLBCL (96.4% hypomethylations in RS) were: depleted in active promoters; and enriched in poised promoters and regions repressed by H3K27me3. Differentially methylated regions (DMRs; see "Methods") between RS and DLBCL were strongly enriched in targets of polycomb complex components SUZ12 (p = 1.2e−121) and EZH2 (p = 1.5e−30), which likely corresponds to the derivation of DLBCL from GC or post-GC B cells. Notably, genes associated with the extracellular matrix were overrepresented in this subset (Supplementary Fig. 9 and Supplementary Data 1). DMRs between RS and CLL were linked to NOTCH and Wnt pathways, and to the adaptive immune system, with PD-1 signaling and T cell/B cell co-stimulations ( Fig. 2d and Supplementary Data 2), which likely corresponds to the driver role of NOTCH and PD-1 signaling in RS onset.

DNA methylation separates CLL-derived and DLBCL-like RS subgroups
The PCA principal component 2 split the RS samples into two subgroups, one with a profile similar to CLL, the other closer to DLBCL (Fig. 2a). We postulated that "CLL-derived RS" (maintaining a CLL imprint) could be separated from "DLBCL-like RS" (distinct from the preceding CLL and closer to DLBCL). To test this, we modeled a linear predictor score (LPS) 38 , computing two underlying probabilities (p): one to label samples according to their CLL-derived RS profile (p CLL-derived ), one for DLBCL-like RS (p DLBCL-like ), defining p CLLderived ≥ 98% and p DLBCL-like ≥ 98% to obtain highly specific and homogeneous groups (see "Methods"; Supplementary Fig. 10). The statistical model devised to compute LPS was constructed with 4863 CpGs robust in separating CLL from DLBCL. Since de novo DLBCLs are usually IGHV-mutated whereas CLL may be IGHV-mutated or -unmutated, we excluded CpGs highly differential according to IGHV status 22,24 from the LPS calculation to focus on other distinctive features between CLL and DLBCL. The LPS scoring system was confirmed with hierarchical clustering (Fig. 2e), non-negative matrix factorization (NMF), PCA (Supplementary Figs. 11 and 12), and displayed differential patterns on normal cells spanning the B cell lineage ( Supplementary  Fig. 13). The scoring system identified 33 CLL-derived RS (57%) and 13 DLBCL-like RS (22%), leaving 12 intermediate samples (21%). This latter subgroup clustered within the CLL and CLL-derived RS branch, albeit marginally (Fig. 2e). The subgroup was then referred to as "low-LPS score CLL-derived RS" ( low CLL-derived RS), in contrast to the "high-LPS score CLL-derived RS" ( high CLL-derived RS). Comparing high CLLderived RS and DLBCL-like RS confirmed global hypomethylation of high CLL-derived RS. In addition, DLBCL-like RS genomic distribution of DNAm did not coincide with that of DLBCL, with most locations hypomethylated in DLBCL-like RS (Fig. 2f, g and Supplementary  Fig. 14). This subgrouping was not influenced by the tumor cell content ( Supplementary Fig. 10).

RS homogeneous subgrouping corroborates with gene expression
Among the 58 RS samples investigated for DNAm, 41 also underwent whole-transcriptome profiling. RNA samples from 6 independent RS cases were also sequenced. In total, the RNA-sequencing experiment included lymph node samples of 47 RS, 2 paired CLLs, and 28 DLBCLs, plus 10 non-tumoral samples for methodologic validation purposes (see "Methods"). Hierarchical clustering of the 23,508 identified genes confirmed clear subgrouping among RS samples (Fig. 3a). All RS classified as DLBCL-like RS by DNAm clustered with DLBCL (predominantly with non-GCB subtype) and separated from CLL-derived RS. This supports the existence of CLL-derived RS and DLBCL-like RS, through cross-validation using an orthogonal technique (>95% concordance). Annotations of gene clusters showed that CLL-derived RS shared a solid CLL gene expression signature, with upregulated genes involved in the BCR pathway and downregulated genes involved in the immune response, p53-signaling, and JAK-STAT pathways. Furthermore, K-means gene clustering of the 47 RS samples ranked according to LPS gradient revealed two main clusters of differentially expressed genes between high CLLderived and DLBCL-like RS (Fig. 3b). One cluster is downregulated in high CLL-derived RS, is related to the extracellular matrix and TLR signaling, and included methylation-regulated p53 activity as an interesting feature (Supplementary Data 3). The other cluster is reminiscent of a CLL signature, overexpressed in high CLL-derived RS, and linked with NOTCH, PI3K signaling, and DNAm metabolism (Supplementary Data 4).

RS subgroups correlate with IGHV mutational status and CLL-RS clonal relationship
To reduce the influence of IGHV mutational status on LPS, CpGs highly differential between U-CLL and M-CLL were filtered from the scoring CpGs. However, IGHV mutational status is associated with major DNAm changes in CLL 22,24,34 . Therefore, we next performed PCA on the 10,000 most variable CpGs, whether associated or not with GC reaction, tagging samples with IGHV annotations (Fig. 3c). CLL-derived RS accounted for nearly 80% of our RS samples and displayed a high ; RS versus DLBCL (n = 68): p = 6.07e−12. c Distribution of differential CpGs (FDR < 0.01; methylation differential >10%) according to the reported chromatin states in 7 CLL reference epigenomes 26 . Enrichments are shown as a heatmap and were calculated from the position of the selected CpGs. Their distribution was reported among 12 different chromatin state categories. Barplots in the right part of each panel show the methylation status difference in RS versus CLL or DLBCL. Differentially methylated CpGs are distributed among 3 methylation level categories. Upward bars indicate a comparative gain of CpGs in RS for the corresponding category, while downward bars indicate a comparative loss in RS. d RS versus CLL top annotations network (ReactomePA) from 238 differential DMRs computed with DMRcate (Fisher's multiple comparison statistics: min_smoothed_FDR and HMFDR both <0.01; max beta-value differential >30%; at least 3 CpGs in the DMR with no gap >1 kb between CpGs). e DNAm-based linear predictor score (LPS) CpG architecture. Hierarchical clustering of 4863 CpGs differential between CLL and DLBCL (FDR < 0.01; beta-value differential >30%; moderated t test). f Density map of DNAm between high CLL-derived and DLBCL-like RS. Smoothed beta-value densities from the EPIC dataset. Scale from blue (no density) to yellow (medium density) and red (high density). g Boxplots showing general methylation levels for high CLL-derived (n = 33), low CLL-derived (n = 12), and DLBCL-like RS (n = 13), de novo DLBCLs (n = 68), and CLLs (n = 215 prevalence of IGHV-unmutated samples. In contrast, 12/13 (93%) DLBCL-like RS were IGHV-mutated. RS subgrouping was thus highly associated with IGHV mutational status (p = 6.3e−9). This raises the possibility that RS subgroup partitioning simply reflects DNAm patterns of U-CLL and M-CLL. However, while most CLL-derived RS samples gathered among U-CLL, DLBCL-like RS samples regrouped with DLBCL, well separated from M-CLL (Fig. 3c).
Moreover, none of the DLBCL-like RS were clonally related to their respective CLL component (n = 5 pairs), confirming that DLBCL-like RS were not M-CLL-derived RS but rather de novo DLBCLs. In contrast, CLL epigenetic imprint is a feature of CLL-derived RS, likely an entity arising from CLL cells ( Supplementary Fig. 15). This CLL-RS clonal relationship was further confirmed by identical IGHV-CDR3 sequences found in paired CLL and RS samples (n = 26 pairs; p = 5.8e−6). To confirm the ability of the LPS to identify CLL-derived RS, we set up an independent validation EPIC 850 K experiment, investigating 52 samples (see "Methods" and Supplementary Fig. 16 Supplementary Fig. 16). These findings clearly indicate that DNAm is a powerful tool to determine the cellular origin in cases diagnosed as RS, as it differentiates DLBCL arising in a patient with CLL from true morphological transformations of CLL.
To further characterize our RS samples, we sequenced a panel of 13 CLL driver genes. Data integrated with copy number variations obtained from DNAm showed a high prevalence of CLL-driver mutations in RS samples harboring a CLL methylation signature (Supplementary Fig. 17). CLL-derived RS and DLBCL-like RS clinical features are displayed in Table 1. Both RS groups were uniformly treated with rituximab-based chemotherapy regimens, yet with inferior outcome for CLL-derived RS (p = 1.7e−3). This was further confirmed with geneexpression profiling, where RS samples aggregating in the CLL-derived branch of the dendrogram (Fig. 3a) were associated with a median OS of only 8 months. In contrast, RS samples clustering with the DLBCLs were associated with a longer median OS (35.5 months; p = 0.018) ( Supplementary Fig. 18).

CLL-derived and DLBCL-like RS feature different epigenetic networks
To better understand the epigenetic architecture of RS subgroups, we performed an integrative analysis based on correlations between DNAm and gene expression data (see "Methods"). The resulting integrome associated 674,567 transcripts with methylation loci. From these, 63,305 (9.4%) significant correlations (p < 0.01, Spearman's rho <−0.33 and >0.33) were first selected. Compared with DLBCL-like RS, high CLL-derived RS were mainly hypomethylated, which transcribed into a dominant direction of overexpression (Fig. 4a). Matching density maps were observed for high CLL-derived and low CLL-derived RS, with only slight differences. In contrast, DLBCL-like epigenomic programs largely differed ( Supplementary Fig. 19), so we undertook an indepth comparison of their integrome against that of high CLL-derived RS. Significant correlations between the two RS groups accumulated at regulatory locations and were mostly negative (77.3%; Fig. 4b). Genes under the control of these regions were related to cell proliferation (cell cycle, NOTCH pathway, PLCγ-mediated BCR signaling), epigenetic regulation and RNA processing, immune response (T-and B-lymphocyte activation and differentiation), and transcriptional regulation, including STAT family transcription factors (TF). Negative

Methylome and transcriptome integration provide insights into RS regulatory features
Key players of RS epigenetic deregulations were further identified in high CLL-derived RS, using DLBCL-like RS as a reference, and the 861 genes transcriptionally controlled through methylation. Among these, 156 were identified as TFs (18.1%; 2.3-fold enrichment; p < 1e−16) 39 . The regulatory network reconstructed in silico from these genes showed a central role of p53-like TFs and STAT proteins, an extensive control emanating from master regulators such as TP53, NF-KB1, and FOXC1, an essential developmental TF in many tissues which may have a role as a tumor suppressor. Over-represented target genes included those of the transcriptional repressors ZNF418 (6.1-fold; FDR = 1.87e−21) and ZNF217 (2.1-fold; FDR = 1.56e−8), involved in differentiation and antagonizing cell death, respectively. On the network, downstream effectors were mainly involved in epigenetic repression via the polycomb complex Prc2 (Supplementary Fig. 21 and Supplementary Data 8), for which we noted a SUZ12 signature (FDR = 5.68e−4) and an EZH2 target enrichment (FDR = 2.84e−4) in B cells, also linked with H3K9me3, H3K27me, and H3K27me3 epigenetic marks (FDR < 3.85e−6 in GM12878 cell line). The 156 TFs were strongly enriched in KRAB domain/C2H2-ZF-type TFs defining homeobox developmental proteins 40 . We observed P300 favored interactions (4.2-fold increase; FDR = 3.1e−3), denoting enhancers as enriched targets 41 . These results support our previous findings and highlight critical pathway reprogramming through selected epigenetic control of key TFs as an important mechanism in RS.
RS-based classifiers uncover "RS-type" DLBCLs with poor outcome DLBCL histological presentation of RS is essential to be distinguished from de novo DLBCL because they differ greatly in terms of prognosis. We thus developed a gene expression based linear classifier score a Unsupervised hierarchical clustering of RS and de novo DLBCL transcriptomes (RNA-Seq; 23,508 genes). b K-means consensus clustering of RS transcriptomes according to DNA methylation-based LPS gradient. Expression level statistics for each cluster are displayed as barplots. Barplot: data are presented as mean values +/− standard deviation from the mean. Cluster 1: n = 1657 genes; p = 1.29e−5. Cluster 6: n = 2203 genes; p = 2.56e−7. p values were derived from two-sided t tests. Source data are provided as a Source data file. Differential clusters are functionally annotated to the right. Mutational statuses as reported with NGS, or abnormalities determined with CNV analysis on DNAm data, are added below sample annotation for a selected panel frequently described in CLL and RS. c Sample partitioning according to IGHV mutational status. Unsupervised PCA clustering of U-RS, M-RS, U-CLL, M-CLL, and DLBCL according to the 10,000 most variable CpGs in the dataset. The focus is made on the most variable CpGs because these are highly representative of the IGHV signature in CLL (59% of these CpGs are strongly differential between U-CLL and M-CLL). Indeed, PC1 separates IGHV-unmutated from IGHV-mutated B cell malignancies, with U-CLLs and U-RS segregating in the same area. Conversely, M-RS partition with DLBCLs, clearly separated from M-CLLs on PC2. CLL chronic lymphocytic leukemia, COO cell of origin, DLBCL de novo diffuse large B cell lymphoma, DLBCL-like RS DLBCL-like Richter syndrome, e enrichment, EBV Epstein-Barr virus, GCB germinal center B cell, high CLL-derived RS CLL-derived RS with a high LPS, LN lymph node, low CLL-derived RS CLL-derived RS with an LPS score below threshold, LPS linear predictor score, M-CLL IGHV-mutated CLL, M-RS IGHV-mutated Richter syndrome, PC1/2 principal component 1/2, q q-value (corrected p value), RS Richter syndrome, U-CLL IGHV-unmutated CLL, U-RS IGHVunmutated Richter syndrome.  Supplementary Fig. 29). This systematic analysis confirmed strong associations with survival, independently from other covariates. In particular, this shorter survival was unrelated to international prognostic index distribution ( Supplementary Fig. 30). We next explored whether this effect might be due to the enrichment of a previously described genomic subgroup of ABC-like DLBCL associated with unfavorable prognosis 10 . In the 562-sample dataset from Wright and colleagues, the 25 cases with top LCS scores were enriched in formerly unassigned (1.5-fold relative enrichment; p = 0.04) and N1 subgroups (6-fold; p = 6e−3) while depleted in EZB subtype (p = 0.03). These cases were also strongly enriched (6.74-fold; p = 4.1e−4) in samples collected at relapse, raising the hypothesis of the ability of our classifier to identify DLBCL prone to relapse. Thus, the extreme LCS values seemed to characterize a distinct subset of ABCtype DLBCL, accounting for 4.3-8.3% of de novo DLBCL, with poor prognosis. The highest 25% scores in the series from Wright and colleagues showed biased distributions in genomic subgroups, dominated by unclassified cases, and associated with shorter PFS and OS (Fig. 6). These findings suggest an ability of the LCS classifier to: (i) identify high-scoring DLBCL samples as a separate DLBCL entity within de novo DLBCL, associated with ABC phenotypes and other features comparable to RS; and (ii) linearly classify other samples according to survival and overall prognosis ( Supplementary Figs. 29 and 31). Interestingly, while absent from the 215-gene list, the CLL-associated marker CD5 was overexpressed in RS versus DLBCL (2.4-fold; FDR = 2.13e −3) and high CLL-derived versus DLBCL-like RS (2.3-fold; FDR = 0.01). In the dataset from Wright and colleagues 10 , CD5 expression was higher in samples within the top 25% LCS than in other samples (p = 5.8e−7), corroborating our results. Last, in a dataset with concomitant transcriptome and CD5 immunochemistry staining GSE66770, the majority (17/22; 77.2%) of the top 25% samples were CD5+ DLBCL (2.1-fold enrichment) while this proportion was significantly lower (16/68; 23.5%) in the rest of the cohort (p = 4.73e−3).

Discussion
In this study, by using genome-wide DNAm analysis and wholetranscriptome gene expression profiling, we extensively characterized the epigenetic architecture of primary human RS samples. We identified a CLL epigenetic imprint that can act as a surrogate for identifying whether an RS is clonally related to CLL or has arisen de novo. Discovery of the CLL imprint in an RS sample avoids reliance on obtaining tumor DNA at the CLL stage. Considering de novo DLBCL, DNAm-and gene expression-based classifiers delineated an RS-like subset in datasets from several landmark studies that was not previously described at the genomic level, was enriched in cases with an ABC-like COO signature, and had an unfavorable prognosis 7,8,10 .
Previous extensive explorations with exome or full genomesequencing had found differences in genomic landscapes between DLBCL-subtype RS and de novo DLBCL [16][17][18][19][20] . Here we used a different study design and methodological approach to expand this knowledge. Firstly, we studied epigenetic deregulations using robust and proven methods, and profiled the RS molecular landscape beyond gene mutations and copy number variations. Secondly, we conducted a comprehensive analysis of RS pathophysiology which combined the analysis of genome-wide DNAm and whole transcriptome profiling, rather than pinpointing a limited number of specific targets. Thirdly, we compared the RS epigenetic profile to that of large cohorts of diverse CLL and de novo DLBCL, which contrasts with previous work mostly focusing on the RS transformation process. Human-derived xenograft mouse models and cell lines were recently reported to study RS biology and test drug response [11][12][13][14][15] . However, the availability of these models is limited and they cannot recapitulate the full heterogeneity of RS, as they were generated from a limited number of tumor samples. Our approach using large cohorts of primary human RS samples and comparative tumor material also holds promise for discoveries and better characterize the wide RS epigenetic complexity. We cross-validated our epigenetic findings using DNAm patterns, that were largely corroborated by transcriptome data, in an independent manner.
Our genome-wide DNAm data provide a more complete RS hypomethylation profile description. The DNAm patterns confirm previous findings that RS is a DNA-hypomethylated entity as compared with CLL and de novo DLBCL 27  part reflect a more extensive proliferative history of the RS subclone 21 , as measured by the epiCMIT mitotic clock 33 . Using a reproducible DNAm microarray uniformly spanning the vast majority of regulatory regions at a whole-genome scale 37 , we characterized the epigenetic architecture underlying the commonly accepted dichotomic heterogeneity with regard to whether a primary RS is clonally related to CLL or has arisen de novo 17 . As expected, around 80% of our RS samples harbored a CLL epigenetic imprint (likely derived from a pre-existing CLL clone). This was confirmed by identical IGHV-CDR3 sequences for all CLL-RS follow-ups. As nearly all de novo DLBCLs harbor a mutated IGHV, we propose that RS clonally related to the underlying CLL clone are: (i) IGHV-unmutated DLBCL; and (ii) IGHV-mutated DLBCL with a CLL imprint. Determining CLL history using DNAm and gene expression by identifying a CLL imprint independently from matched-CLL availability is a step forward, and is essential for clinical and therapeutic management. Interestingly, DLBCL-like RS would conversely be DLBCL without clonal relationship with the CLL counterpart. However, DNAm of DLBCL-like RS differed from that of de novo DLBCL in terms of increased cell cycle activity and IGF1, ERK/MAPK, PI3K/AKT, and PD-1 signaling pathways. These differences suggest influences of the CLLinvaded microenvironment for the development of a specific DLBCL pathogenesis 43 . Moreover, by integrating the DNAm and transcriptomic data, we evidenced different epigenetic networks in CLL-derived and DLBCLlike RS. Epigenetic architecture remodeling and subsequent deregulation of EZH2 and Wnt pathways, as well as PI3kinase/AKT and IGFR1 signaling cascades, unravel CLL-derived RS underlying mechanisms potentially responsible for chemotherapy resistance. These mechanisms are potentially druggable through EZH2, PI3K/AKT, or IGFR1 inhibitors. IGFR1 pathway triggering was recently described as a resistance mechanism to targeted therapy in CLL 44 . Interestingly, O6methylguanine-DNA methyltransferase MGMT regulatory sequences are hypomethylated and MGMT is consequently overexpressed in CLLderived RS. MGMT promoter hypomethylation status is a known negative prognostic marker in glioblastoma 45 , de novo DLBCL 46 , and an actionable target. This marker is easily assessable in the context of DLBCL diagnosis and routinely used to guide therapeutic decisions.
Our results show B cell-specific TF implication and epigenetic imprint in CLL-derived RS, and emphasize the previously described important role of TP53, FOXC1, NF-KB, and epigenetic regulators in oncogenic mechanisms. Strikingly, genes involved in the regulation of TP53 activity through methylation were overexpressed in CLL-derived RS, confirming the central role of TP53 in clonally-related RS and the primary importance of epigenetic deregulation in the transformation process. An interesting finding of this study is the putative role of the FOXC1 TF in the RS regulatory network. FOXC1 has previously been described as cooperating with HOX family members for orchestrating mesenchymal tissue development, through NF-KB signaling 47 . FOXC1 is PRC2 repressed during hematopoietic development, but frequently derepressed in hematopoietic progenitors in acute myeloid leukemia 48 . Our data identified FOXC1 derepression as a hallmark of CLL-derived RS, likely associated with the blockade of B cell development and proliferation due to NF-KB signaling unleashing. We also observed hypomethylation of DMRs regulating the expression of genes involved in the extracellular matrix organization, and in the immune system. These observations suggest a strong influence of the microenvironment in RS development.
Notably, our findings directly translate into classification and prognostication of de novo DLBCL, the most common human B cell lymphoma. We provide a gene expression-based, stable, reproducible, and potentially widely applicable classifier, on the basis of a CLLderived RS epigenetic imprint. The classifier differentiates a particular DLBCL subgroup from supposedly de novo DLBCL datasets. Of clinical importance, cases assigned to this subgroup are frequently not detected by recently described genomic and gene expression classifiers of DLBCL, and they are associated with an unfavorable prognosis. These cases were ABC-like DLBCL, enriched in unclassified or N1 DLBCL genomic subtypes 7,10 . This is in line with the association of RS with a particular gene expression profile and with NOTCH1 mutations and NOTCH pathway activation. Given the efficacy of ibrutinib plus R-CHOP chemotherapy in N1 subtype DLBCL 49 , the enrichment in N1 profile within RS samples supports research into whether these patients may also benefit from BTK inhibition combined with R-CHOP chemotherapy. However, a recent single cell transcriptome analysis of sequential CLL-RS samples revealed that, as compared to the CLL cells, RS cells downregulate genes related to BCR signaling and upregulate those involved in oxidative phosphorylation 21 , and therefore RS may be less sensitive to Ibrutinib. By applying a stringent cut-off to our transcriptomic score, generalized to all studied DLBCL datasets 8-10,42 , we identified a separate de novo DLBCL subset associated with a median PFS comparable to that of clonally-related RS. Based on our observations, 4-8% of DLBCL diagnosed as de novo DLBCL, nonotherwise specified, may in fact be a subgroup of DLBCLs sharing common epigenetic and transcriptional features with clonally related RS, and with a similar unfavorable outcome. We propose a stable and reproducible expression-based classifier widely applicable to transcriptomic data, enabling the identification of this specific entity within supposedly de novo DLBCL, termed "RS-type DLBCL." Limitations of the transcriptome scoring method are dataset size and composition (DLBCL features associated with outcome), which by design prevent the exploration of single samples independently and may exert biases. However, the method also demonstrated the linear association of DLBCL scores with poor outcome and clinical variables of cancer aggressiveness, and so constitutes a means for improving the current DLBCL classification system.
In conclusion, our study has revealed several relevant aspects of RS biology, including the complete RS hypomethylation profile and differentiation of clonal versus non-clonal RS according to DNAm patterns and gene expression profiles. The discovery of a CLL imprint allows clonal relationship assessment without the need for tumor DNA at the CLL stage. Subgrouping of primary RS samples according to extensive characterization of the epigenetic architecture has provided information underlying oncogenic processes, with clear clinical implications. In particular, identification of RS-type DLBCL cases helps to advance the current DLBCL classification system and could be incorporated in treatment decisions, potentially improving disease management. Our findings also enable the evaluation of larger cohorts recruited in clinical trials and the development of novel treatment approaches, which are urgently needed in RS.

Methods
Our methods and results made extensive use of data from previous landmark studies 26,31 . Care was taken to follow good practices in the analyses of methylome and transcriptome data, employing widely approved procedures previously used in other high-standard studies. Regarding the handling of large cohorts, we used sample correlations, performed genotype checks between omics data, and added technical and biologic replicates wherever possible.

Patients and materials
A multicenter registry of RS accrual was established across nine centers affiliated to the French Innovative Leukemia Organization (Clinical-Trials.gov Identifier: NCT03619512). Sixty-four patients diagnosed with DLBCL-subtype RS were enrolled. Fresh frozen biopsies were gathered at RS diagnosis and met the criteria for DLBCL, including diffuse patterns of large B cells with the same size as macrophages or twice the size of normal lymphocytes 3,50 . For all patients, diagnoses were reviewed and confirmed by two independent pathologists. Only RS samples with at least 50% (median 80%, range 50-95%) high-grade component assessed by pathology review were selected for analysis. The same process was applied for assembling a validation cohort of 58 samples, further reduced to 52 QC-passed samples, which we processed to an independent EPIC 850K experiment. Fifty-eight of the 64 enrolled patients with RS were from a previously described cohort, and both targeted NGS sequencing and DNAm exploration were performed; 56 of these 58 patients with RS underwent 18F-fluorodeoxyglucose positron emission tomography/computed tomography for initial diagnosis 51 . For the other six patients with RS, the fresh frozen biopsy was too small for extracting both DNA and RNA, and due to the large cellular component (>70%), we prioritized gene expression data and only RNA sequencing was performed. The minimal tumor purity was raised to 70% for RNA analysis, as contamination by signal from residual normal cells strongly influences global gene expression, especially for a subset of transcripts with very low expression in tumor cells but high expression in residual normal cells.
Additional data for CLL (n = 215), and 92 normal B cells spanning the entire B lineage development were obtained as part of previously published studies 22,25,26,34,35 . DNA methylation from 68 de novo DLBCL cases were also used as a reference. These DLBCLs originate from a larger lymphoma cohort gathered by the ICGC MMML-seq consortium 52 . Finally, 10 lymph nodes from healthy subjects were analyzed as a control group for transcriptome sequencing.

Methylome data analyses
EPIC microarray. DNAm status of 866,562 CpG sites was interrogated on the Infinium Methylation EPIC array (Illumina, San Diego, CA, USA; see Supplementals), later referred to as the EPIC 850K platform.
Dataset generation. Datasets were created using the minfi package 53 . The EPIC set comprises 90 distinct samples (58 RS, 25 CLL, plus a subset of 7 DLBCL replicates also available on 450K), interrogated on EPIC 850K. DNA methylation data from the control groups (215 CLL, 68 DLBCL, 92 normal B cells spanning the entire B-lineage) were acquired with the Illumina Infinium® HumanMethylation450 BeadChip (later referred to as the 450K platform) 22,25,26,34 . These and the EPIC 850K data were processed from IDAT files. Analyzes were run under R 3.6 with Bioconductor 3.10 and later versions. The FULL dataset comprises 433 distinct samples (92 benign B cells, 215 CLLs, 68 DLBCLs and 58 RS), combined into a single 450K object containing probes shared by 850K and 450K microarrays: (i) raw IDAT files corresponding to 96 and 377 samples for the 850K (866,091 CpGs) and 450K (485,512 CpGs) platforms, respectively, and included technical replicates; (ii) each subset was loaded independently, stored into a dedicated RGChan-nelSet minfi object, along with full sample annotations, then both were combined into a third subset containing 473 samples × 452,567 CpGs using the combineArrays function with output type as "IlluminaHu-manMethylation450k"; (iii) the EPIC dataset stems from the first (850K) subset alone, the FULL dataset is obtained from the combined subsets.
To reduce technological issues and biases, the same preparation protocol was applied to both EPIC (850K) and FULL (combined) subsets. The main stages of the filtering and quality control pipeline are as follows: (i) technical checks, filtering, and evaluations (ii) data normalization with SWAN 54 ; (iii) probes located on X and Y chromosomes, flagged as cross-hybridization probes, or located near known SNPs were further removed with the rmSNPandCH function (with parameters dist = 2 and mafcut = 0.05) available from the DMRcate package 55 ; (iv) imputation of the remaining failed β-value positions with imputePCA of the R missMDA package 56,57 , (v) 2 × 2 sample correlation checks (Supplementary Figs. 32 and 33). Correlation heat maps were rendered with the R corrplot package; (vi) extended quality control step to remove sample outliers and check for residual postnormalization batch effects ( Supplementary Fig. 34); (vii) ultimately, technical replicates were averaged into unique samples as all replicates were found comparable ( Supplementary Fig. 35). These filtering steps led to the final EPIC (90 samples × 794,927 CpGs) and FULL (433 samples × 397,769 CpGs) datasets.
Technical checks, filtering, evaluations, and quality control. These steps included failed CpGs removal (>10% samples with a detection p value >0.01), gender check between clinical data and gender returned by the getSex function, and genotype checks (Supplementary Data 10) between RNA-seq data (see Supplementals) and genotypes inferred with the beta2genotype function available from the R OmicsPrint package 58 .
Cell composition deconvolution. Cell type composition was estimated for each sample with the estimateCellCounts function against a library of 6 normal white blood cells (CD8 T cells, CD4 T cells, NK, B cells, monocytes, and granulocytes) (Supplementary Data 11). The proportions of each explored cell type were reported and later used as covariates in statistical models to adjust for B cell representation in the mixes. Blood samples deprived in B cells (<30%) were thus discarded from further analyses.

Downstream bioinformatics
Supervised analyses. As a rule, β-values were used for direct interpretation and graphical representation, while M-values were favored for statistics and computations. Linear modeling based on empirical Bayesian methods was used to assess for CpG differential methylation. When applicable, these models included cell deconvolution results as added covariates to correct for B cell content. Additionally, at this point, any unwanted methylation variation such as residual batch effects were removed by using the RUVm function from package missMethyl 59 . The overall dispersion was calculated on the entire dataset, then p values for each comparison were obtained with a twoway moderated t test and adjusted for FDR following the Benjamini-Hochberg procedure. At probe level, an FDR < 0.01 indicated statistical significance. Differentially methylated region (DMR) determination was performed on the same linear models with dmrcate (package DMRcate), with lambda = 1000 and C = 3. FDR cut-off for first allowing a CpG to initiate a DMR was set to FDR = 0.01, and DMRs were considered statistically significant if both min_smoothed_fdr and HMFDR output probabilities were <0.01.
Unsupervised analyses. Explorations were conducted on β-values, and all methods used Euclidean distances as (dis)similarity metrics. PCAs were performed with R packages FactoMineR and factoextra, on the entire datasets or a subset of the top variant CpGs across all considered samples. Hierarchical clusterings included complete and average linkage criteria, and resulting heat maps and dendrograms were rendered with the R package ComplexHeatmap. Non-Negative Matrix Factorizations were performed with the R package NMF, either on all CpGs or a subset of the most variant ones according to the context, with method lee, and parameters ranking = 3 and iterations = 50.
Feature annotations. Methylome data were analyzed using the available Illumina 450K and EPIC platform annotations, which strongly rely upon the hg19 assembly. As several tools like minfi and DMRcate still use those by default, CpG and DMR locations/annotations were lifted to hg38 coordinates as an after-computation-process when required, especially when dealing with integrations with transcriptomic and epigenomic data. Additional CpG annotations included B cell development modules 34 , UCSC tracks for the EBV-transformed GM12878 cell line, such as DNaseI and chromatin marks from ENCODE and transcription factor ChIP-Seq peaks from ENCODE3, histone modifications, and chromHMM chromatin states for 7 reference CLL epigenomes (2 U-CLLs and 5 M-CLLs) 26 . Any liftOver of coordinates between hg19-and hg38-annotated data was achieved with the UCSC table browser or with R packages liftOver and XGR.
Gene set and pathway analyses. Unbiased functional annotations on ontological terms (GO) and KEGG pathways were achieved at CpG and DMR levels with the R package missMethyl. Additional enrichment analyses were conducted on curated gene lists with Enrichr 60 . Reactome pathway overrepresentations and enrichment analyses were performed with ReactomePA 61 on curated sets of unique genes associated with identified DMRs. Enrichments in sets of CpGs, DMRs, target genes, GO terms, or pathways were calculated as the occurrences of the selection against a background representing the entire dataset (enrichment = observed frequency/expected frequency). p values associated with enrichment analyses were obtained with (i) an overrepresentation test, (ii) Fisher's exact test, or (iii) a Chi-square test, depending on the context and group size.
Linear predictor score (LPS). To formally distribute RS samples into subgroups, we developed a scoring predictor inspired by the work of Wright and colleagues on transcriptome data, that successfully separated GCB from ABC DLBCL 38 . Here, we applied LPS on methylome data, with a cohort composed of all 58 RS samples, 215 CLLs, and 68 DLBCLs. As we aimed to best discriminate between CLL and DLBCL profiles, only the highly differential CpGs between the two groups were considered in the analysis (261,085; FDR < 0.01; moderated tstatistics were retained for further use in the score computation). To lessen the impact of B cell IGHV maturity on the scoring model, we next subtracted CpGs that were also differential between U-CLL and M-CLL (128,408; FDR < 0.01). The 181,231 remaining CpGs were then filtered into 4863 CpGs with high methylation differential (beta-value differential or β-Fold-Change), that is, >30%. This amount was considered appropriate as: (i) statistical power to discriminate such a methylation differential was reached; (ii) probe composition was balanced between regulatory region/gene body/intergenic location as compared to the background; (iii) it provided sufficient number to expect a normal distribution of LPS within subgroups; and (iv) those CpGs demonstrated a strong correlation structure among the groups of samples (Fig. 2e).
Finally, from each of these 4863 CpGs and for each sample S of the cohort, the score was calculated, with t i representing the moderated t-statistic for CpG i and S i the corresponding methylation β-value. Known score distribution of CLL and DLBCL samples within their respective subgroup G 2 CLL,DLBCL ½ allowed the Bayesian likelihood approximation for RS samples S to belong in each one of them, with probability and PðS in G = CLLÞ ' 1 À PðS in G = DLBCLÞ where Φ computes the normal density function with the estimated meansμ and variancesσ 2 of LPS within either subgroup G. To finally obtain highly specific and homogeneous subgroups, thresholds were defined as follows: (i) p(S in G = CLL) ≥ 0.98 for CLL-derived (namely p CLL-derived ), and (ii) p(S in G = DLBCL) ≥ 0.98 for DLBCL-like (p DLBCL-like ) labeling, since the gray zone between the two main groups is centered on scores for which the probability density functions overlap at values >0.02 either way ( Supplementary Fig. 10).

Transcriptomics
All samples were processed within the same batch. Demultiplexed single-end sequencing data corresponding to 50-nucleotide-long reads were available in FASTQ files, one for each of the 87 samples, and used in the next processing steps. The cohort was composed of 47 RS samples, 2 paired-CLLs, 28 DLBCLs, and 10 controls from normal lymph nodes. The 10 normal controls were added for methodologic purposes: (i) normalization; (ii) checking benign profiles against B cell malignancies; (iii) checking the feeble amplitude of the transcriptomic component separating inflammatory (n = 3) from non-inflammatory lymph nodes (n = 7); and (iv) validation of the efficiency of the developed scoring methods. Raw abundance filtering and normalization. Raw counts were filtered by applying a minimum expression threshold for a gene or transcript. Those had to be expressed (non-zero value) in at least two samples and present an average expression value across all samples higher than 1/5,000,000 of the average library size (64 ± 3 million reads per sample), that is, at least 20 reads per feature. Data was further adjusted with the TMM normalization method 64 , and finally was log2 and cpm (count per million) transformed 65 . A total of 23,508 genes and 77,491 transcripts were identified and reported at the end of the process. Pearson's correlations for gene expression levels averaged at 0.92 for genes and 0.75 for transcripts and were very stable across samples (data not shown).
Gene and transcript annotations. All transcriptomic analyses were performed using the hg38 reference assembly of the human genome. Results were fully annotated with known symbols corresponding to gene and transcript genomic locations whenever possible. Upon completion of the transcript assembly, gene symbols were assigned Ensembl IDs based on overlapping positions with known transcripts (90% overlap minimum). In case of failed overlap, custom and unique IDs were used. Therefore, gene and transcript assignments were based on the Ensembl 66 GRCh38 annotations available in both core and funcgene databases, version 90. These were downloaded from ftp:// ftp.ensembl.org/pub/release-90/mysql/ for local installation and query with in-house custom tools).
Transcriptome explorations. Unsupervised analyses were all carried out with hierarchical and K-means clustering techniques, as previously described 67 . Expression values were median-centered, and uncentered Pearson's correlation was used as distance metrics. Supervised analyses were performed through linear modeling (empirical Bayes), and differential expression p values were obtained using a two-way moderated t test then adjusted for FDR following the Benjamini-Hochberg procedure. An FDR < 0.01 indicated statistical significance. Cluster dissection was achieved with functional annotation tools for target gene associations, such as the Open Targets platform 68 , and gene signature correlation with public datasets from multiple databases, such as GEO (Gene Expression Omnibus), with Enrichr [https:// maayanlab.cloud/Enrichr].

Methylome and transcriptome data integrations
Here we focused on the RS cohort, for which 41 RS samples overlapped between methylome and transcriptome experiments. A subset of the methylome EPIC dataset (M-values, normalized and curated) and part of the transcriptome dataset (gene and transcript CPMsalso normalized and filtered) were integrated to eliminate unwanted signals and pinpoint the functional mechanisms linking DNA methylation of regulatory regions with gene expression in RS. Both datasets were re-annotated with biomaRt 69 and linked using two methods: (i) with shared Ensembl identifiers; and (ii) by genomic coordinates for refined feature overlap when the first method failed. We used "TSS200," "TSS1500," and "first exon" CpG information to define associations with promoter regions in the next analysis steps, and overlap was considered successful within 2 kb between CpG and gene transcription start sites (TSS). The integromes generated at this step represented 475,148 and 674,567 associations at the gene and transcript levels, respectively. As described in a similar setup 70 , Spearman's correlations were calculated for each association. Correlations at the gene level were used for generating density plots and presenting a general view, whereas transcripts were used for precise analyses and final results. These were filtered into candidate transcriptional effector locations, by selecting "promoter regions" containing at least three negativelycorrelated CpGs (rho < −1/3; p value <0.001) or three positively correlated CpGs (rho > 1/3; p value <0.001) with features corresponding to "TSS200," "TSS1500," "first exon," or "TSSoverlap2kb" (each linked to the same transcript identifier).
Manhattan representations were plotted against the background with the R CMplot package. Gene set enrichment and pathway analyses of selected candidate lists were carried out as described in Methylome data analyses. Interaction networks of putative TFs encoded by candidate genes, protein domain enrichments, and effector functions were performed with STRING tools [https://string-db.org/] 71 . A curated database of 1639 human TFs with DNA-binding domain information was obtained from http://humantfs.ccbr.utoronto.ca/. Regulatory networks were built with NetworkAnalyst [www.networkanalyst.ca] 72 .
Methodology for building the gene expression-based scoring system CLL-derived RS signature. A 215-gene set was obtained by extracting two clusters of strongly correlated up-and down-regulated profiles from the transcriptome hierarchical clustering tree (Fig. 3a, Supplementary Fig. 36, and Supplementary Data 9). The two initial clusters displayed a very high enrichment in CLL genes and mainly drove the whole sample aggregation process. These were further reduced to protein-coding genes, to avoid biases when applying the signature to transcriptomes of different origins, which may not contain ncRNAs or genes of undefined biotype. The reduced set was then overlapped with genes integrating significantly between transcriptome and methylome. The resulting 215-gene signature contained 93 protein-coding genes underexpressed in CLL-derived RS and 122 protein-coding genes overexpressed in CLL-derived RS.
Linear classifier score (LCS). For each analyzed dataset, scores were obtained according to the following procedure, to render the process as reproducible as possible. (i) When applicable, raw expression data with relevant sample annotations were retrieved from the Gene Expression Omnibus curated database (https://www.ncbi.nlm.nih.gov/ geo/) with GEOquery 73 . Expression matrices were then prepared, described statistically, and normalized according to a well-established protocol 74 . Otherwise, already normalized expression data were used "as is". (ii) Whole transcriptomes were reduced to their features (genes, transcripts, probes) corresponding to matches with the 215-gene signature. (iii) Expression values were summed up over genes to obtain an aggregated and unique expression for each gene. (iv) Data were scaled, i.e., mean-centered and standard-deviation-reduced. (v) Positive outliers were trimmed at the last permille (99.9%) to reduce the impact of extreme gene expression values on the score but preserve high enough values as essential markers. Trimmed values were replaced with the last permille value. After a distribution check, no negative outliers were found in any dataset. (vi) For each of the 215 genes, weights were assigned: those originating from the upregulated cluster were weighted +1 and those originating from the downregulated cluster were weighted −1. (vii) Finally, LCS scores were computed as the mean of weighted gene expressions for each sample S of the dataset: with n the number of genes in the signature, and G i representing the gene i weighted by W i . LCS scores were then standardized (meancentering to 0 and standard-deviation-reducing to obtain scores fully comparable between datasets). The obtained Z-scores were compared to a normal distribution in a one-way test to calculate a p value, used to define the initial LCS cutoff (p < 0.05) in each dataset.

Statistics and reproducibility
No statistical method was used to predetermine sample size. Data exclusion criteria according to quality controls are explained in the "Methods" section. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
Raw DNA methylation, gene expression and targeted NGS data generated in this study from RS samples have been deposited in the European Genome-Phenome Archive (study EGAS00001005495) under accession number EGAD00010002194 for DNA methylation data; accession number EGAD00001007922 for transcriptomic data, and accession number EGAD00001009509 for targeted NGS data. The raw data are protected and available under restricted access. Clinical and genomic data can be obtained by contacting the data access committee, according to the European Genome-Phenome Archive's procedure. Data access will be granted if their use complies with the data use conditions, including a commitment to strictly use these data for a clearly identified academic research programs and according to good practice recommendations. The Data Access Committee will respond to requests within 2 weeks. Once access to the data is granted, these are available until the end of the research program they support. Previously published DNA methylation datasets from the ICGC MMMLseq consortium that were used in this study are available upon request from the data access committee at the ICGC consortium data portal [https://dcc.icgc.org/]. Published datasets can be found under the following accession codes: GSE103265; GSE66770; GSE10846; GSE98588; GSE87371. All other data supporting the findings of this study are available from the corresponding authors upon request. Source data are provided with this paper.

Code availability
The source code developed for this study for designing the DNam and gene expression classifiers and the methylome-transcriptome integrative analyses is available on the GitHub platform, [https:// github.com/zetcheuv/RichterOmicsCode]. All other source data supporting the findings of this study are available from the corresponding authors.